Add xarray Dataset/DataArray support to ACCESS_ESM_CMORiser #145

rbeucher · 2025-12-15T04:22:12Z

Overview

This PR adds support for using xarray Dataset and DataArray objects as direct inputs to ACCESS_ESM_CMORiser, enabling in-memory processing workflows while maintaining full backward compatibility with existing file-based workflows.

Problem Statement

The current CMORiser assumes a list of files as input:

cmoriser = ACCESS_ESM_CMORiser(
    input_paths=files,  # Must be file paths
    compound_name="Amon.rsds", 
    # ... other parameters
)

This creates limitations for workflows where:

Data is already loaded as xarray objects
Users want to process derived/computed variables
Integration with xarray-based analysis pipelines is needed
Temporary file I/O should be avoided for performance

Solution

New `input_data` Parameter

Added a new input_data parameter that accepts:

File paths (strings or lists) - same as before
xarray Dataset - used directly for CMORisation
xarray DataArray - automatically converted to Dataset

Usage Examples

With xarray Dataset

import xarray as xr
from access_moppy import ACCESS_ESM_CMORiser

# Load data as xarray Dataset
ds = xr.open_dataset("model_output.nc")

# Use directly with CMORiser  
cmoriser = ACCESS_ESM_CMORiser(
    input_data=ds,  # New parameter
    compound_name="Amon.rsds",
    experiment_id="historical",
    source_id="ACCESS-ESM1-5",
    variant_label="r1i1p1f1",
    grid_label="gn",
    activity_id="CMIP"
)

cmoriser.run()
cmoriser.write()

With xarray DataArray

# Extract specific variable as DataArray
temperature = ds.temp  

# Still works - automatically converted to Dataset
cmoriser = ACCESS_ESM_CMORiser(
    input_data=temperature,
    compound_name="Amon.tas",
    # ... other parameters
)

Backward Compatibility

Existing code continues to work unchanged:

# This still works exactly as before
cmoriser = ACCESS_ESM_CMORiser(
    input_paths=files,  # Old parameter - shows deprecation warning
    compound_name="Amon.rsds",
    # ... other parameters  
)

- Add new `input_data` parameter to accept xarray Dataset or DataArray objects - Maintain full backward compatibility with existing `input_paths` parameter - Automatically convert DataArrays to Datasets for processing - Skip frequency validation for xarray inputs (data already loaded) - Update all CMORiser subclasses (Atmosphere, Ocean OM2/OM3) to support new interface - Preserve all existing functionality (resampling, chunking, validation) - Add comprehensive parameter validation and deprecation warnings This enables in-memory processing workflows and integration with xarray-based analysis pipelines while maintaining compatibility with existing file-based workflows. All existing tests pass (34/34) confirming no breaking changes.

codecov · 2025-12-15T04:26:06Z

Codecov Report

❌ Patch coverage is 13.35312% with 292 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.35%. Comparing base (5c87de5) to head (06845ea).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/access_moppy/utilities.py	4.93%	212 Missing ⚠️
src/access_moppy/base.py	27.63%	55 Missing ⚠️
src/access_moppy/atmosphere.py	22.22%	14 Missing ⚠️
src/access_moppy/driver.py	42.10%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #145      +/-   ##
==========================================
- Coverage   57.55%   52.35%   -5.21%     
==========================================
  Files          18       18              
  Lines        2403     2722     +319     
==========================================
+ Hits         1383     1425      +42     
- Misses       1020     1297     +277

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rhaegar325

Thanks @rbeucher for this PR, the overall structure is very clear and the implementation is well organised. The design aligns nicely with the existing CMORisation workflow, and after some basic testing I didn’t observe any obvious bugs.

There are a few edge cases around bounds handling that might be worth keeping in mind for future improvements:

For ocean variables, the current time_bnds handling is not yet fully covered. When input_data is provided as a DataArray, time_bnds can be missing, which may lead to runtime errors. In practice, this means the workflow relies on a helper function that automatically derives time_bnds from time.

A similar pattern appears for atmosphere variables with spatial bounds. For example, if bounds variables are not present, errors such as

KeyError: "No variable named 'lon_bnds'. Variables on the dataset include ['lon', 'time', 'lat', 'pr']"

can occur.

Additionally, if the dataset is loaded without decode_cf=False, time coordinates may be decoded as cftime objects, which can trigger a TypeError in atmosphere.py (around line 174) when they are implicitly cast to numeric types:

TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian'.

These don’t look like blockers for this PR, but it might be useful to document these assumptions or add some lightweight safeguards around bounds generation and time handling in a follow-up.

…alue

rhaegar325 · 2026-01-07T10:34:49Z

src/access_moppy/base.py

-                )
-                if resampling_required:
+
+                # Keep only required data variables


Dropped redundant coordinates and dimensions to prevent them from affecting other parts of the workflow. These steps were previously handled implicitly by xr.open_mfdataset() and now need to be handled explicitly.

rhaegar325 · 2026-01-07T10:36:14Z

src/access_moppy/base.py

+                    used_dims.update(self.ds[var].dims)
+
+                # Exclude auxiliary time dimension
+                if "time_0" in used_dims:


time_0 is a special coords and need to be handled specificly.

rhaegar325 · 2026-01-07T10:40:35Z

src/access_moppy/base.py

            self.ds = self.chunker.rechunk_dataset(self.ds)
            print("✅ Dataset rechunking completed")

+    def _ensure_numeric_time_coordinates(self, ds: xr.Dataset) -> xr.Dataset:


Method to convert cftime format to numeric values

rhaegar325 · 2026-01-07T10:41:46Z

src/access_moppy/utilities.py

    return ds_resampled, True
+
+
+def calculate_time_bounds(


3 methods for calculate time_bnds, lat_bnds and lon_bnds

rhaegar325 · 2026-01-07T10:43:30Z

src/access_moppy/base.py

+            )  # Make a copy to avoid modifying original
+
+            # SAFEGUARD: Convert cftime coordinates to numeric if present
+            self.ds = self._ensure_numeric_time_coordinates(self.ds)


Added a safeguard to handle cases where the input data use cftime in time-related coordinates and variables.

rhaegar325 · 2026-01-07T12:03:23Z

Add automatic calculation of missing coordinate bounds and fix cftime handling for CMIP6 compliance

This PR implements automatic calculation of missing coordinate bounds (time_bnds, lat_bnds, lon_bnds) and fixes cftime object handling to ensure CMIP6 compliance when processing raw model data.

Changes Made

1. cftime to numeric conversion safeguard

Added _ensure_numeric_time_coordinates() method to convert cftime objects to numeric values when datasets are loaded with decode_cf=True
Prevents TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian' in downstream operations
Preserves time encoding attributes (units, calendar) after conversion
Uses default units='days since 0001-01-01' when missing

2. Coordinate cleanup improvements

Implemented proper removal of unused coordinates in load_dataset()
Added handling for auxiliary dimensions (e.g., time_0) via isel(time_0=0, drop=True)
Only keeps coordinates that are actually used as dimensions by data variables
Prevents dimension mismatch errors in transpose operations

3. New utility functions in `utilities.py`

calculate_latitude_bounds(): Calculates latitude bounds for both regular and irregular grids, with proper handling of polar boundaries (clipping to [-90°, 90°])
calculate_longitude_bounds(): Calculates longitude bounds supporting both 0-360° and -180-180° conventions, with automatic detection of global vs. regional grids and proper handling of periodic boundaries

4. Enhanced `calculate_time_bounds()` function

Added time_coord parameter to support different time coordinate names
Added bnds_name parameter to support different bounds dimension names ("nv" for ocean, "bnds" for atmosphere)

5. Updated `CMIP6_Atmosphere_CMORiser` class

Added automatic detection of missing bounds variables during select_and_process_variables()
Automatically calculates missing bounds using the appropriate utility function based on coordinate type
Issues user warnings when bounds are missing from raw data and being auto-calculated
Maintains flexibility for different coordinate naming conventions (lat/latitude/y, lon/longitude/x, time/t)

rhaegar325

1. cftime to numeric conversion safeguard

Added _ensure_numeric_time_coordinates() method to convert cftime objects to numeric values when datasets are loaded with decode_cf=True
Prevents TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian' in downstream operations
Preserves time encoding attributes (units, calendar) after conversion
Uses default units='days since 0001-01-01' when missing

2. Coordinate cleanup improvements

Implemented proper removal of unused coordinates in load_dataset()
Added handling for auxiliary dimensions (e.g., time_0) via isel(time_0=0, drop=True)
Only keeps coordinates that are actually used as dimensions by data variables
Prevents dimension mismatch errors in transpose operations

3. New utility functions in `utilities.py`

calculate_latitude_bounds(): Calculates latitude bounds for both regular and irregular grids, with proper handling of polar boundaries (clipping to [-90°, 90°])
calculate_longitude_bounds(): Calculates longitude bounds supporting both 0-360° and -180-180° conventions, with automatic detection of global vs. regional grids and proper handling of periodic boundaries

4. Enhanced `calculate_time_bounds()` function

Added time_coord parameter to support different time coordinate names
Added bnds_name parameter to support different bounds dimension names ("nv" for ocean, "bnds" for atmosphere)

5. Updated `CMIP6_Atmosphere_CMORiser` class

Added automatic detection of missing bounds variables during select_and_process_variables()
Automatically calculates missing bounds using the appropriate utility function based on coordinate type
Issues user warnings when bounds are missing from raw data and being auto-calculated
Maintains flexibility for different coordinate naming conventions (lat/latitude/y, lon/longitude/x, time/t)

Those changes allow xarray.Dataarray could be used as an input format for Moppy

rbeucher requested a review from rhaegar325 December 15, 2025 04:23

fix: Improve deprecation warnings and input handling in CMORiser classes

127a952

rhaegar325 reviewed Dec 15, 2025

View reviewed changes

rhaegar325 added 4 commits January 7, 2026 14:54

add bounds calculator

3cbfa36

add safeguard for time related coords and convert cftime to numeric v…

c12a876

…alue

fix coords issue

bb421a8

rpe-commit fix

5875258

rhaegar325 reviewed Jan 7, 2026

View reviewed changes

adjust format

06845ea

rhaegar325 self-requested a review January 8, 2026 22:59

rhaegar325 approved these changes Jan 8, 2026

View reviewed changes

rbeucher merged commit 0eb8d0e into main Jan 9, 2026
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add xarray Dataset/DataArray support to ACCESS_ESM_CMORiser #145

Add xarray Dataset/DataArray support to ACCESS_ESM_CMORiser #145

Uh oh!

rbeucher commented Dec 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

rhaegar325 left a comment

Uh oh!

rhaegar325 Jan 7, 2026

Uh oh!

rhaegar325 Jan 7, 2026

Uh oh!

rhaegar325 Jan 7, 2026

Uh oh!

rhaegar325 Jan 7, 2026

Uh oh!

rhaegar325 Jan 7, 2026

Uh oh!

rhaegar325 commented Jan 7, 2026 •

edited

Loading

Uh oh!

rhaegar325 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add xarray Dataset/DataArray support to ACCESS_ESM_CMORiser #145

Add xarray Dataset/DataArray support to ACCESS_ESM_CMORiser #145

Uh oh!

Conversation

rbeucher commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem Statement

Solution

New input_data Parameter

Usage Examples

With xarray Dataset

With xarray DataArray

Backward Compatibility

Uh oh!

codecov bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rhaegar325 left a comment

Choose a reason for hiding this comment

Uh oh!

rhaegar325 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

rhaegar325 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

rhaegar325 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

rhaegar325 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

rhaegar325 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

rhaegar325 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add automatic calculation of missing coordinate bounds and fix cftime handling for CMIP6 compliance

Changes Made

1. cftime to numeric conversion safeguard

2. Coordinate cleanup improvements

3. New utility functions in utilities.py

4. Enhanced calculate_time_bounds() function

5. Updated CMIP6_Atmosphere_CMORiser class

Uh oh!

rhaegar325 left a comment

Choose a reason for hiding this comment

1. cftime to numeric conversion safeguard

2. Coordinate cleanup improvements

3. New utility functions in utilities.py

4. Enhanced calculate_time_bounds() function

5. Updated CMIP6_Atmosphere_CMORiser class

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rbeucher commented Dec 15, 2025 •

edited

Loading

New `input_data` Parameter

codecov bot commented Dec 15, 2025 •

edited

Loading

rhaegar325 commented Jan 7, 2026 •

edited

Loading

3. New utility functions in `utilities.py`

4. Enhanced `calculate_time_bounds()` function

5. Updated `CMIP6_Atmosphere_CMORiser` class

3. New utility functions in `utilities.py`

4. Enhanced `calculate_time_bounds()` function

5. Updated `CMIP6_Atmosphere_CMORiser` class